Developping Tools and Building Linguistic Resources for Vietnamese Morpho-syntactic Processing

نویسندگان

  • Thanh Bon Nguyen
  • Nguyên Thi Minh Huyên
  • Laurent Romary
  • Xuân Luong Vu
چکیده

Vietnamese is spoken by about 80 millions people around the world, yet very few concrete works on this language have been noticed in Natural Language Processing (NLP) until now. The fundamental problems in automatic analysis of Vietnamese, such as part-ofspeech (POS) tagging, parsing, etc. are extremely difficult due to the lack of formal linguistic knowledge on one hand, and the specificities of isolating languages on the other hand. In this paper we present our efforts to develop a set of tools permitting the construction and management of language resources for Vietnamese in a normalized framework, whose aim is to be largely distributed and usable for research purposes in NLP. We first define a tagset by constructing Vietnamese morpho-syntactic descriptors that fit in a model compatible with MULTEXT, so as to account for possible multilingual applications as well as the reusability of defined tagsets. We then implement a system undertaking the tasks of word segmentation and POS tagging. Our system ensures a representation format of linguistic resources that is currently considered in the framework of ISO TC37 SC4. Finally we attempt to construct a formal syntactic description of nominal groups using the Tree Adjoining Grammar (TAG) formalism.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A lexicon for Vietnamese language processing

Only very recently have Vietnamese researchers begun to be involved in the domain of Natural Language Processing (NLP). As there does not exist any published work in formal linguistics nor any recognizable standard for Vietnamese word definition and word categories, the fundamental tasks for automatic Vietnamese language processing, such as part-of-speech tagging, parsing, etc., are very diffic...

متن کامل

Lexical descriptions for Vietnamese language processing

Only very recently have Vietnamese researchers begun to be involved in the domain of Natural Language Processing. As there does not exist any published work in formal linguistics or any recognizable standard for Vietnamese word categories, the fundamental works in Vietnamese text analysis such as part-of-speech tagging, parsing, etc. are very difficult tasks for computer scientists. All necessa...

متن کامل

Portable Language Technology: a Resource-light Approach to Morpho-syntactic Tagging

Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number, gender, and other morphological information to each word in a corpus. Morpho-syntactic tagging is an important step in natural language processing. Corpora that have been morphologically tagged are very useful both for linguistic research, e.g. finding instances or frequencies of particular constructions in...

متن کامل

Standardisation and Interoperation of Morphosyntactic and Syntactic Annotation Tools for Spanish and their Annotations

Linguistic annotation tools and linguistic annotations are scarcely syntactically and/or semantically interoperable. Their low interoperability usually results from the number of factors taken into account in their development and design. These include (i) the type of phenomena annotated (either morphosyntactic, syntactic, semantic, etc.); (ii) how these phenomena are annotated (e.g., the parti...

متن کامل

Sixth International Joint Conference on Natural Language Processing Proceedings of the 11th Workshop on Asian Language Resources

Bilingual corpora play an important role as resources not only for machine translation research and development but also for studying tasks in comparative linguistics. Manual annotation of word alignments is of significance to provide a gold-standard for developing and evaluating machine translation models and comparative linguistics tasks. This paper presents research on building an English-Vi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004